scafari_vignette

Introduction

Single-cell DNA sequencing (scDNA-seq) enables the analysis of mutation profiles at single-cell resolution. This cutting-edge technique makes it possible to shed light on the previously inaccessible complexity of heterogeneous tissues such as tumors.

Requirements

To run scafari, you need R (Version 4.4 or higher).

Running scafari

Installation

scafari is available through Bioconductor. In R, enter the following commands to install scafari:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("scafari")

A development version of scafari will be available on GitHub and can be installed as follows:

install.packages("devtools")
#> Installing package into '/usr/local/lib/R/site-library'
#> (as 'lib' is unspecified)
devtools::install_github("sophiewind/scafari")
#> Using GitHub PAT from the git credential store.
#> Downloading GitHub repo sophiewind/scafari@HEAD
#> Skipping 41 packages ahead of CRAN: zlibbioc, XVector, SparseArray, S4Arrays, IRanges, S4Vectors, MatrixGenerics, BiocGenerics, GenomeInfoDbData, GenomeInfoDb, DelayedArray, Biobase, GenomicRanges, Biostrings, KEGGREST, BiocFileCache, GenomicAlignments, BiocParallel, biomaRt, BiocIO, Rhtslib, GenomicFeatures, BSgenome, rtracklayer, AnnotationDbi, Rsamtools, SummarizedExperiment, ProtGenerics, AnnotationFilter, ensembldb, VariantAnnotation, Rhdf5lib, bamsignals, biovizBase, regioneR, rhdf5filters, SingleCellExperiment, org.Hs.eg.db, karyoploteR, ComplexHeatmap, rhdf5
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#>      checking for file ‘/tmp/RtmppzZQU2/remotese2659c92807/sophiewind-scafari-ec38894/DESCRIPTION’ ...  ✔  checking for file ‘/tmp/RtmppzZQU2/remotese2659c92807/sophiewind-scafari-ec38894/DESCRIPTION’
#>   ─  preparing ‘scafari’:
#>    checking DESCRIPTION meta-information ...  ✔  checking DESCRIPTION meta-information
#>   ─  checking for LF line-endings in source and make files and shell scripts
#>   ─  checking for empty or unneeded directories
#>      Omitted ‘LazyData’ from DESCRIPTION
#>        NB: this package now depends on R (>= 3.5.0)
#>        WARNING: Added dependency on R >= 3.5.0 because serialized objects in
#>      serialize/load version 3 cannot be read in older versions of R.
#>      File(s) containing such objects:
#>        ‘scafari/input/ensembl_37_snp.rds’
#>   ─  building ‘scafari_0.99.0.tar.gz’
#>      
#> 
#> Installing package into '/usr/local/lib/R/site-library'
#> (as 'lib' is unspecified)

Getting started

A basic workflow analysis of scDNA-seq data is demonstrated in the following on an exemplary dataset. First we need to load scafari.

library(scafari)
#> 

Once the package is installed and have been loaded, the analysis can start. scafari is available as R package and shiny GUI. You can run the shiny version by RStudio executing the function TODO(). The scafari worklow can be split into four main parts.

  1. Sequencing information and quality control
  2. Panel analysis
  3. Variant analysis
  4. Analyses of variants of special interest

These parts are represented in the four tabs of the shiny app and several functions in the R package.

Sequencing information and quality control

The analysis start with uploading the .h5 file. In the shiny app an interactive upload is provided on the starting page. Information from the .h5 file are loaded in an sce class object in this step.

suppressPackageStartupMessages(library(SingleCellExperiment))

# Load .h5 file
h5 <- h5ToSce('../testdata/demo.h5')
sce <- h5$sce_amp
se.var <- h5$se_var

After successful data upload the metadata including sequencing information can be accessed and basic quality information can be inspected.

# Get metadata
sce@metadata
#> $ado_rate
#> [1] "NA"
#> 
#> $avg_mapping_error_rate
#> [1] "0.006751065"
#> 
#> $avg_panel_uniformity
#> [1] "0.960629921259842"
#> 
#> $chemistry_version
#> [1] "V2"
#> 
#> $genome_version
#> [1] "hg19"
#> 
#> $n_amplicons
#> [1] "127"
#> 
#> $n_bases_r1
#> [1] "1313531484"
#> 
#> $n_bases_r1_q30
#> [1] "1257088621"
#> 
#> $n_bases_r2
#> [1] "1313531484"
#> 
#> $n_bases_r2_q30
#> [1] "1191181775"
#> 
#> $n_cell_barcode_bases
#> [1] "433072697"
#> 
#> $n_cell_barcode_bases_q30
#> [1] "433315642"
#> 
#> $n_cells
#> [1] "1313"
#> 
#> $n_read_pairs
#> [1] "8698884"
#> 
#> $n_read_pairs_mapped_to_cells
#> [1] "5417202"
#> 
#> $n_read_pairs_trimmed
#> [1] "8573237"
#> 
#> $n_read_pairs_valid_cell_barcodes
#> [1] "8419726"
#> 
#> $n_reads_mapped
#> [1] "16453845"
#> 
#> $n_reads_mapped_insert
#> [1] "13777334"
#> 
#> $panel_name
#> [1] "AML"
#> 
#> $pipeline_version
#> [1] "2.0.2"

Further, reads per TODO barcode can be visualized in a log-log plot.

logLogPlot(sce)

Panel analysis

The panel analysis start by normalizing the read counts using normalizeReadCounts(). The normalized read counts are stored in the corresponding assay (normalized.counts).

sce <- normalizeReadCounts(sce = sce)

After read counts are normalized they can be annotated with biological relevant information such as exon number and canonical transcript ID.

annotateAmplicons(sce = sce)

The exact genomic locations of the amplicons can be inspected in the amplicon disribution plot.

plotAmpliconDistribution(sce)

The amplicon performance plot depicts normalized read counts per amplicon. This can help to identify low performing amplicons and get deeper insights into potential copy number alterations (Sarah).

plotNormalizedReadCounts(sce = sce)

Insights in the normalized read counts per amplicon on single cell resolution is provided in the panel uniformity plot. Each dot represenents one cell and each row an amplicon. The amplicons are sorted by their mean reads, from high to low. Amplicons meeting at least 20% of the average depth of coverage are represented in blue, others in orange.

plotPanelUniformity(sce, interactive = FALSE)
#> No id variables; using all as measure variables

Variant analysis

The variant analysis starts with variant filtering and is performed on criteria such as sequencing depth, variant allele frequency (VAF), genoytpe data and genotype quality data. Further, the minimum percentage of cells that could be assigned with a genotype other than missing (wild type, heterozygous, homozygous) and the minimum percentage of cells with mutated genotypes (heterozygous, homozygous).

suppressPackageStartupMessages(library(SummarizedExperiment))

filteres <- filterVariants(depth.threshold = 10,
                           genotype.quality.threshold = 30,
                           vaf.ref = 5, 
                           vaf.het =  35, 
                           vaf.hom = 95, 
                           min.cell = 50,
                           min.mut.cell = 1,
                           se.var = se.var,
                           sce = sce,
                           shiny = FALSE)
#> start variant filtering...
se.f <- SummarizedExperiment(assays = list(VAF = t(filteres$vaf.matrix.filtered), 
                                           Genotype = t(filteres$genotype.matrix.filtered),
                                           Genoqual = t(filteres$genoqual.matrix.filtered)),
                              rowData = filteres$variant.ids.filtered,
                              colData = filteres$cells.keep)
    
# Filter out cells in sce object
# Find the indices of the columns to keep
indices_to_keep <- match(filteres$cells.keep, SummarizedExperiment::colData(sce)[[1]], nomatch = 0)
    
# Subset the SCE using these indices
sce_filtered <- sce[, indices_to_keep]
SingleCellExperiment::altExp(sce_filtered, "variants") <- se.f

After filtering the variants are annotated with biological information like gene names and clinical variant metrics. By default, annotation is performed via the Mission Bio API, which provides information on the names of the gene and the affected protein, the coding effect, the functional classification (e.g. intronic, coding), ClinVar information, the dbSNP ID and the DANN (Deleterious Annotation of genetic variants using Neural Networks) score. The annotated variants can than be inspected.

suppressPackageStartupMessages(library(DT))
suppressPackageStartupMessages(library(dplyr))

# Load sce object with filtered variants
# Check metadata 
metadata(altExp(sce_filtered))
#> list()

# Annotate variants
sce_filtered <- annotateVariants(sce = sce_filtered)

# Check metadata. Now a annotation flag is present
metadata(altExp(sce_filtered))
#> $annotated
#> [1] TRUE

# The annotations are stored in the rowData
rowData(altExp(sce_filtered)) %>% 
  as.data.frame() %>% 
  head() %>% 
  datatable()

To gain deeper insights into the VAF of filtered variants across the cells, a VAF heatmap with genotype and chromosome annotations is available.

plotVariantHeatmap(sce = sce_filtered)

More detailed insights into the genotype and their quality can be obtained in a violin plot.

plotGenotypequalityPerGenotype(sce = sce_filtered)

Analysis of variants of interest

To further focus on selected variants of interest, there is a set of function to analyze user selected variants. Thereby, variants forming cell clusters/clones can be identified and analyzed regarding VAF and genotype. First the user need to define a set of variants of interest.

variants.of.interest <- c("KIT:chr4:55599436:T/C", 
                          "TET2:chr4:106158216:G/A", 
                          "FLT3:chr13:28610183:A/G",
                          "TP53:chr17:7577427:G/A")

After selecting the variants of interest, the cells can be clustered based on their VAF. The clustering returns the k-means results and the clusterplot itself in a list object.

plot <- clusterVariantSelection(sce = sce_filtered,
                        variants.of.interest = variants.of.interest,
                        n.clust = 3)

names(plot)
#> [1] "k_means"     "clusterplot"

The clustered cells plot is stored in clusterplot.

plot$clusterplot

To compare the clusters regarding the variant selection there are various visualizations available. The distribution of variant allele frequency (VAF) within each cell cluster can be obtained using plotClusterVAF().

plotClusterVAF(sce = sce_filtered, 
               variants.of.interest = variants.of.interest,
               gg.clust = plot$clusterplot)

plotClusterVAFMap(sce = sce_filtered, 
               variants.of.interest = variants.of.interest,
               gg.clust = plot$clusterplot)

Further, the clusters can be analyzed regarding the genotype using plotClusterGenotype().

plotClusterGenotype(sce = sce_filtered, 
               variants.of.interest = variants.of.interest,
               gg.clust = plot$clusterplot)
#> Using cluster as id variables

Data

Session information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.1.4                 DT_0.33                    
#>  [3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
#>  [5] Biobase_2.64.0              GenomicRanges_1.56.2       
#>  [7] GenomeInfoDb_1.40.1         IRanges_2.38.1             
#>  [9] S4Vectors_0.42.1            BiocGenerics_0.50.0        
#> [11] MatrixGenerics_1.16.0       matrixStats_1.5.0          
#> [13] scafari_0.99.0             
#> 
#> loaded via a namespace (and not attached):
#>   [1] later_1.4.2              BiocIO_1.14.0            filelock_1.0.3          
#>   [4] bitops_1.0-9             tibble_3.2.1             XML_3.99-0.18           
#>   [7] rpart_4.1.23             httr2_1.0.1              factoextra_1.0.7        
#>  [10] karyoploteR_1.30.0       lifecycle_1.0.4          rstatix_0.7.2           
#>  [13] doParallel_1.0.17        processx_3.8.4           lattice_0.22-6          
#>  [16] ensembldb_2.28.1         crosstalk_1.2.1          backports_1.5.0         
#>  [19] magrittr_2.0.3           Hmisc_5.2-3              plotly_4.10.4           
#>  [22] sass_0.4.10              rmarkdown_2.29           jquerylib_0.1.4         
#>  [25] yaml_2.3.10              remotes_2.5.0            shinyBS_0.61.1          
#>  [28] httpuv_1.6.16            sessioninfo_1.2.2        pkgbuild_1.4.4          
#>  [31] shinycustomloader_0.9.0  DBI_1.2.3                RColorBrewer_1.1-3      
#>  [34] abind_1.4-8              pkgload_1.3.4            zlibbioc_1.50.0         
#>  [37] purrr_1.0.4              AnnotationFilter_1.28.0  biovizBase_1.52.0       
#>  [40] RCurl_1.98-1.17          nnet_7.3-19              rappdirs_0.3.3          
#>  [43] VariantAnnotation_1.50.0 circlize_0.4.16          GenomeInfoDbData_1.2.12 
#>  [46] ggrepel_0.9.6            codetools_0.2-20         DelayedArray_0.30.1     
#>  [49] xml2_1.3.8               tidyselect_1.2.1         shape_1.4.6.1           
#>  [52] UCSC.utils_1.0.0         farver_2.1.2             BiocFileCache_2.12.0    
#>  [55] base64enc_0.1-3          bamsignals_1.36.0        GenomicAlignments_1.40.0
#>  [58] jsonlite_2.0.0           GetoptLong_1.0.5         ellipsis_0.3.2          
#>  [61] waiter_0.2.5             Formula_1.2-5            iterators_1.0.14        
#>  [64] foreach_1.5.2            progress_1.2.3           tools_4.4.1             
#>  [67] Rcpp_1.0.14              glue_1.8.0               gridExtra_2.3           
#>  [70] SparseArray_1.4.8        xfun_0.52                usethis_2.2.3           
#>  [73] withr_3.0.2              fastmap_1.2.0            rhdf5filters_1.16.0     
#>  [76] shinyjs_2.1.0            callr_3.7.6              digest_0.6.37           
#>  [79] R6_2.6.1                 mime_0.13                colorspace_2.1-1        
#>  [82] biomaRt_2.60.1           dichromat_2.0-0.1        markdown_2.0            
#>  [85] RSQLite_2.3.11           tidyr_1.3.1              generics_0.1.3          
#>  [88] data.table_1.17.0        rtracklayer_1.64.0       prettyunits_1.2.0       
#>  [91] httr_1.4.7               htmlwidgets_1.6.4        S4Arrays_1.4.1          
#>  [94] regioneR_1.36.0          pkgconfig_2.0.3          gtable_0.3.6            
#>  [97] blob_1.2.4               ComplexHeatmap_2.20.0    XVector_0.44.0          
#> [100] htmltools_0.5.8.1        carData_3.0-5            profvis_0.3.8           
#> [103] ProtGenerics_1.36.0      clue_0.3-66              scales_1.4.0            
#> [106] png_0.1-8                knitr_1.50               rstudioapi_0.17.1       
#> [109] reshape2_1.4.4           rjson_0.2.23             checkmate_2.3.2         
#> [112] curl_6.2.2               org.Hs.eg.db_3.19.1      cachem_1.1.0            
#> [115] rhdf5_2.48.0             GlobalOptions_0.1.2      stringr_1.5.1           
#> [118] parallel_4.4.1           miniUI_0.1.1.1           shinycssloaders_1.1.0   
#> [121] foreign_0.8-86           AnnotationDbi_1.66.0     restfulr_0.0.15         
#> [124] desc_1.4.3               pillar_1.10.2            grid_4.4.1              
#> [127] vctrs_0.6.5              ggpubr_0.6.0             urlchecker_1.0.1        
#> [130] promises_1.3.2           car_3.1-3                dbplyr_2.5.0            
#> [133] xtable_1.8-4             cluster_2.1.6            htmlTable_2.4.3         
#> [136] evaluate_1.0.3           GenomicFeatures_1.56.0   cli_3.6.5               
#> [139] compiler_4.4.1           bezier_1.1.2             Rsamtools_2.20.0        
#> [142] rlang_1.1.6              crayon_1.5.3             ggsignif_0.6.4          
#> [145] labeling_0.4.3           ps_1.7.6                 plyr_1.8.9              
#> [148] fs_1.6.6                 stringi_1.8.7            viridisLite_0.4.2       
#> [151] BiocParallel_1.38.0      Biostrings_2.72.1        lazyeval_0.2.2          
#> [154] devtools_2.4.5           Matrix_1.7-0             BSgenome_1.72.0         
#> [157] hms_1.1.3                bit64_4.6.0-1            ggplot2_3.5.2           
#> [160] Rhdf5lib_1.26.0          KEGGREST_1.44.1          shiny_1.10.0            
#> [163] broom_1.0.8              memoise_2.0.1            bslib_0.9.0             
#> [166] bit_4.6.0